A comprehensive evaluation of multicategory classification methods for microbiomic data
نویسندگان
چکیده
BACKGROUND Recent advances in next-generation DNA sequencing enable rapid high-throughput quantitation of microbial community composition in human samples, opening up a new field of microbiomics. One of the promises of this field is linking abundances of microbial taxa to phenotypic and physiological states, which can inform development of new diagnostic, personalized medicine, and forensic modalities. Prior research has demonstrated the feasibility of applying machine learning methods to perform body site and subject classification with microbiomic data. However, it is currently unknown which classifiers perform best among the many available alternatives for classification with microbiomic data. RESULTS In this work, we performed a systematic comparison of 18 major classification methods, 5 feature selection methods, and 2 accuracy metrics using 8 datasets spanning 1,802 human samples and various classification tasks: body site and subject classification and diagnosis. CONCLUSIONS We found that random forests, support vector machines, kernel ridge regression, and Bayesian logistic regression with Laplace priors are the most effective machine learning techniques for performing accurate classification from these microbiomic data.
منابع مشابه
A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis
MOTIVATION Cancer diagnosis is one of the most important emerging clinical applications of gene expression microarray technology. We are seeking to develop a computer system for powerful and reliable cancer diagnostic model creation based on microarray data. To keep a realistic perspective on clinical applications we focus on multicategory diagnosis. To equip the system with the optimum combina...
متن کاملOne-against-all multicategory classification via discrete support vector machines
Discrete support vector machines (DSVM), recently proposed in [l01 and [ l l ] for binary classification problems, have been shown to outperform other competing approaches on well-known benchmark datasets. Here we address their extension to multicategory classification, by developing a one-against-all framework in which a set of binary discrimination problems are solved by means of DSVM. Comput...
متن کاملMatched Gene Selection and Committee Classifier for Molecular Classification of Heterogeneous Diseases
Microarray gene expressions provide new opportunities for molecular classification of heterogeneous diseases. Although various reported classification schemes show impressive performance, most existing gene selection methods are suboptimal and are not well-matched to the unique characc ©2010 Guoqiang Yu, Yuanjian Feng, David J. Miller, Jianhua Xuan, Eric P. Hoffman, Robert Clarke, Ben Davidson,...
متن کاملReinforced Multicategory Support Vector Machines
Support vector machines are one of the most popular machine learning methods for classification. Despite its great success, the SVM was originally designed for binary classification. Extensions to the multicategory case are important for general classification problems. In this article, we propose a new class of multicategory hinge loss functions, namely reinforced hinge loss functions. Both th...
متن کاملSparse partial least squares classification for high dimensional data.
Partial least squares (PLS) is a well known dimension reduction method which has been recently adapted for high dimensional classification problems in genome biology. We develop sparse versions of the recently proposed two PLS-based classification methods using sparse partial least squares (SPLS). These sparse versions aim to achieve variable selection and dimension reduction simultaneously. We...
متن کامل